This report explores the Prosper Loan data set which contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

Univariate Plots Section

## [1] 113937     81
## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ Occupation                         : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ EmploymentStatus                   : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Info About Loans

Let’s look at the distribution of loans according to their terms in months.

## 
##    12    36    60 
##  1614 87778 24545

It is also interesting to see, how high the loan amounts are (with different bin sizes $1000 and $5000). Most loans are under $5000.

Loans range from $1000 to $35000 max. The median is $6500.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Applying log scale on the histogram does not make any signs of normality visible .

Distribution of Borrower annual rates shows that most rates range from ~0.15% to ~0.2%.

Loans per year/month It turns out that date is in the date format. We will extract years and months to create plots for time-series and seasonality. Most loans were taken in 2013. Also it turns out, that the number of loans increases with the course of the year and that most people take out a loan in October and December. It would be interesting to analyse how high the total amount of all loans was per year and what the average is. Year 2014 was removed from the monthly perspective, as it was not finished completely when the dataset was created and there would #todo: to univariate Average loan amounts per year. And total amounts. It is well possible, that 2014 is not over yet, as the sum of loans is not quite as high as 2013, however, the average is so far still higher than 2013.

## geom_bar: na.rm = FALSE
## stat_count: na.rm = FALSE
## position_stack

The bulk of the distribution of the prosper score lies in the middle. Most of people get average scores.

Taking a look at the usage categories of loans, it turns out that more than 50% are user for debt consolidation.

Info About Borrowers

Most of the recipients are employed, however they also do not specify on the type of their emplyoment. It would be interesting to find out, how long employment status last depending on each type. There are many empty employment types which were changed to ‘Not available’ in order to clean the data.

The incomes do not seem to be ordered. This needs to be fixed. Also, there is a group labeled as ‘Not employed’. One has to ask, whether to join this group with the $0 group, however, it can not be said whether the ‘Not employed’ group can be treated as such, as they may have other income sources (e.g. stocks, rental income, …)

## 
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

Looking at the distribution of income ranges, most people earn between $25000 and $49999. The second larges group ranges from $50000 to $74999. It would be interesting to see, how the different income ranges are composed in terms of employment status, which we will see in the multi-variate part of the analysis.

Open vs Current Credit lines Open and current credit lines are a right-skewed distribution.

Home ownership is around 50-50 (Slightly more home owners)

## False  True 
## 56459 57478

Most borrowers have high bank card utilization. Utilisation over 1.0 would then mean overdrafting the bank account.

Most people take loans as individuals.

Univariate Analysis

What is the structure of your dataset?

The dataset contains 113937 observations of 81 features.

What is/are the main feature(s) of interest in your dataset?

BorrowerAPR

BorrowerRate

LenderYield

ProsperScore

EmploymentStatus

EmploymentStatusDuration

BankcardUtilization

IncomeRange

LoanOriginalAmount

LoanOriginationDate

Term

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

ListingCategory

IsBorrowerHomeowner

CurrentlyInGroup

CurrentDelinquencies

DelinquenciesLast7Years

PublicRecordsLast12Months

PublicRecordsLast10Years

AvailableBankcardCredit

CurrentCreditLines

OpenCreditLines

Did you create any new variables from existing variables in the dataset?

Year

Month

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Employment Status had an empty values issue which is why they were changed to “Not available” to match the existing variable. As Income Range was unordered, the ordering was adjusted and 100k+ incomes were put to the right position. Also, the date column was unusable which is why Year and Month were extracted and put into separate columns. Listing Categories were converted from numerical values to strings, in order to understand the different categories better in the plot.

Bivariate Plots Section

As “Not employed” is much steeper than Employed and e.g. full time this suggests that most people are fortunately unemployed only for a relatively short time. Also part time employment does not seem to last as long as full time, having bulk of distribution closer to the right.

The borrower annual rate decreases with the increase in their score. This is also a moderate correlation of 0.65.

## 
##  Pearson's product-moment correlation
## 
## data:  df$ProsperScore and df$BorrowerAPR
## t = -261.68, df = 84851, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6719940 -0.6645469
## sample estimates:
##        cor 
## -0.6682872

Borrower Rate (obviously) has highly linear relation to Lender yield, however, there are some values, which are not on the line… We will analyse this phenomenon in the multi variate analysis.

## 
##  Pearson's product-moment correlation
## 
## data:  df$BorrowerRate and df$LenderYield
## t = 8493.9, df = 113940, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9992021 0.9992204
## sample estimates:
##       cor 
## 0.9992113

Thers is slight correlation between loan amount and and APR.

## 
##  Pearson's product-moment correlation
## 
## data:  df$BorrowerAPR and df$LoanOriginalAmount
## t = -115.14, df = 113910, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3280787 -0.3176752
## sample estimates:
##        cor 
## -0.3228867

Mean APR had its highest point in 2011 and then decreased again.

## 
##  Pearson's product-moment correlation
## 
## data:  df$Year and df$BorrowerAPR
## t = 21.946, df = 113910, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.05910109 0.07066652
## sample estimates:
##        cor 
## 0.06488598

Most borrowers have less than 50 Credit lines and less than 25 delinquencies in the past 7 years.

People with higher bakcard utilisation tend to have higher loan rates.

## 
##  Pearson's product-moment correlation
## 
## data:  df$BankcardUtilization and df$BorrowerAPR
## t = 88.323, df = 106330, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2558295 0.2670290
## sample estimates:
##      cor 
## 0.261438

People with more income tend to borrow higher amounts. Only people earning over 100k got loans higher than 25k.

Borrowers with higher income also get better anual rates. The fact that people with an income of $0 get low rates could be that there are many students in this group who get cheap student loans.

Prosper Score depends on income. If score is smaller than 50k, it stays the same on average. It is also the same for ranges from 50k to 75k. Average is highest above 100k.

On log scale, delinquencies become better visible. However, they do not really vary visibly among income ranges.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

It turns out that people with higher income on average get higher amounts and better rates for loans.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Loan rates increased until 2011 and then decreased again. Maybe this is a consequence of the financial crisis around 2008.

What was the strongest relationship you found?

BorrowerRate, LenderYield -> 0.99 BorrowerAPR, ProsperScore -> -0.67 ProsperScore, LoanOriginalAmount -> -0.32

Multivariate Plots Section

As “Not employed” and “$0” are not of the same quality, the analysis of employment statuses was conducted to see the composition of different statuses.

Borrower rate is strictly linear to the Lender Yield. However, if there is no score available, the lender yield seems to deviate from the ideal margin line as can be seen in the following two plots. The upper one has the NA Prosper Score values removed while the lower one incorporates them as grey dots.

Relationship between BorrowerAPR, Loan Amount and Score shows that APR slightly decreases when amount increases and when score increases.

Long term loans (60 months) tend to have higher amounts than shorter loans (36 months). Short term loans (12 tend to have the lowest amounts).

When comparing Income Ranges in terms of loan amounts, it can be seen that the bulks of the high income ranges are more to the right side (higher loans) and the low income ranges on the left side (lower loans).

Bank card utilisation per income range shows that higher income ranges have a larger bulk at high credit card utilisation rates.

Mean loan amount per year per income range again clearly shows, that higher income ranges take get higher loans. Also it can be seen, that there is no value for $0 for 2014. The group is not very realistic, which may be why it was dropped. “Not displayed” shows a lack of data before 2007.

Relationship between borrower rate, term and its APR Different Terms are clustered in this plot based on their Rate/APR ratio.

Lower income is connected to lower employment duration.

Lower income ranges have much higher APRs. After 2012 they even increased, while for higher income ranges APRs decreased and ceased to exist after 2013.

Linear Model for score

## 
## Calls:
## m1: lm(formula = BorrowerAPR ~ ProsperScore, data = df)
## m2: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount, 
##     data = df)
## m3: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year, data = df)
## m4: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus, data = df)
## m5: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization, data = df)
## m6: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization + IncomeRange, 
##     data = df)
## m7: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization + IncomeRange + 
##     Month, data = df)
## m8: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization + IncomeRange + 
##     Month + Term, data = df)
## m10: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization + IncomeRange + 
##     Month + Term + CurrentDelinquencies + DebtToIncomeRatio, 
##     data = df)
## m11: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization + IncomeRange + 
##     Month + Term + CurrentDelinquencies + DebtToIncomeRatio + 
##     CurrentlyInGroup, data = df)
## m12: lm(formula = BorrowerAPR ~ ProsperScore + LoanOriginalAmount + 
##     Year + EmploymentStatus + BankcardUtilization + IncomeRange + 
##     Month + Term + CurrentDelinquencies + DebtToIncomeRatio + 
##     CurrentlyInGroup + IsBorrowerHomeowner, data = df)
## 
## ============================================================================================================================================================================================================================
##                                                   m1              m2              m3              m4              m5              m6              m7              m8             m10             m11             m12        
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                     0.360***        0.377***       42.134***       52.814***       54.260***       54.059***       56.838***       58.612***       57.799***       57.906***       58.707***  
##                                                  (0.001)         (0.001)         (0.325)         (0.369)         (0.363)         (0.363)         (0.396)         (0.397)         (0.418)         (0.419)         (0.416)    
##   ProsperScore                                   -0.022***       -0.020***       -0.023***       -0.023***       -0.022***       -0.022***       -0.022***       -0.022***       -0.021***       -0.021***       -0.021***  
##                                                  (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   LoanOriginalAmount                                             -0.000***       -0.000***       -0.000***       -0.000***       -0.000***       -0.000***       -0.000***       -0.000***       -0.000***       -0.000***  
##                                                                  (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   Year                                                                           -0.021***       -0.026***       -0.027***       -0.027***       -0.028***       -0.029***       -0.029***       -0.029***       -0.029***  
##                                                                                  (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   EmploymentStatus: Full-time/Employed                                                           -0.040***       -0.042***       -0.042***       -0.045***       -0.045***       -0.044***       -0.044***       -0.043***  
##                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   EmploymentStatus: Not employed/Employed                                                         0.026***        0.028***       -0.002          -0.002          -0.002          -0.043          -0.038          -0.029     
##                                                                                                  (0.002)         (0.002)         (0.008)         (0.007)         (0.007)         (0.048)         (0.048)         (0.047)    
##   EmploymentStatus: Other/Employed                                                                0.002*          0.003***       -0.000           0.000           0.001          -0.000          -0.001           0.001     
##                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   EmploymentStatus: Part-time/Employed                                                           -0.041***       -0.041***       -0.050***       -0.053***       -0.053***       -0.050***       -0.050***       -0.050***  
##                                                                                                  (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)    
##   EmploymentStatus: Retired/Employed                                                             -0.030***       -0.030***       -0.034***       -0.036***       -0.035***       -0.036***       -0.035***       -0.034***  
##                                                                                                  (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)    
##   EmploymentStatus: Self-employed/Employed                                                       -0.021***       -0.018***       -0.018***       -0.018***       -0.017***       -0.029***       -0.029***       -0.029***  
##                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.007)         (0.007)         (0.007)    
##   BankcardUtilization                                                                                             0.032***        0.034***        0.035***        0.034***        0.036***        0.036***        0.038***  
##                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   IncomeRange: $1-24,999/$0                                                                                                      -0.016*         -0.016*         -0.016*                                                    
##                                                                                                                                  (0.007)         (0.007)         (0.007)                                                    
##   IncomeRange: $100,000+/$0                                                                                                      -0.032***       -0.031***       -0.030***       -0.012***       -0.012***       -0.006***  
##                                                                                                                                  (0.007)         (0.007)         (0.007)         (0.001)         (0.001)         (0.001)    
##   IncomeRange: $25,000-49,999/$0                                                                                                 -0.028***       -0.027***       -0.028***       -0.010***       -0.010***       -0.008***  
##                                                                                                                                  (0.007)         (0.007)         (0.007)         (0.001)         (0.001)         (0.001)    
##   IncomeRange: $50,000-74,999/$0                                                                                                 -0.033***       -0.032***       -0.032***       -0.014***       -0.014***       -0.010***  
##                                                                                                                                  (0.007)         (0.007)         (0.007)         (0.001)         (0.001)         (0.001)    
##   IncomeRange: $75,000-99,999/$0                                                                                                 -0.033***       -0.032***       -0.032***       -0.014***       -0.014***       -0.009***  
##                                                                                                                                  (0.007)         (0.007)         (0.007)         (0.001)         (0.001)         (0.001)    
##   Month: 02/01                                                                                                                                   -0.003***       -0.003***       -0.003***       -0.003***       -0.003***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 03/01                                                                                                                                   -0.003***       -0.004***       -0.004***       -0.004***       -0.004***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 04/01                                                                                                                                   -0.001          -0.003***       -0.004***       -0.004***       -0.004***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 05/01                                                                                                                                   -0.004***       -0.006***       -0.007***       -0.007***       -0.007***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 06/01                                                                                                                                   -0.004***       -0.007***       -0.008***       -0.008***       -0.008***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 07/01                                                                                                                                   -0.004***       -0.007***       -0.007***       -0.007***       -0.007***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 08/01                                                                                                                                   -0.007***       -0.009***       -0.010***       -0.010***       -0.010***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 09/01                                                                                                                                   -0.005***       -0.006***       -0.007***       -0.007***       -0.007***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 10/01                                                                                                                                   -0.008***       -0.009***       -0.009***       -0.009***       -0.009***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 11/01                                                                                                                                   -0.012***       -0.013***       -0.013***       -0.013***       -0.013***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Month: 12/01                                                                                                                                   -0.018***       -0.019***       -0.018***       -0.019***       -0.019***  
##                                                                                                                                                  (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   Term                                                                                                                                                            0.001***        0.001***        0.001***        0.001***  
##                                                                                                                                                                  (0.000)         (0.000)         (0.000)         (0.000)    
##   CurrentDelinquencies                                                                                                                                                            0.005***        0.005***        0.005***  
##                                                                                                                                                                                  (0.000)         (0.000)         (0.000)    
##   DebtToIncomeRatio                                                                                                                                                               0.007***        0.007***        0.008***  
##                                                                                                                                                                                  (0.001)         (0.001)         (0.001)    
##   CurrentlyInGroup: True/False                                                                                                                                                                   -0.006***       -0.006***  
##                                                                                                                                                                                                  (0.001)         (0.001)    
##   IsBorrowerHomeowner: True/False                                                                                                                                                                                -0.012***  
##                                                                                                                                                                                                                  (0.000)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                       0.447           0.513           0.592           0.612           0.625           0.627           0.631           0.637           0.639           0.639           0.645     
##   adj. R-squared                                  0.447           0.513           0.592           0.612           0.625           0.627           0.631           0.636           0.639           0.639           0.645     
##   sigma                                           0.059           0.056           0.051           0.050           0.049           0.049           0.049           0.048           0.048           0.048           0.047     
##   F                                           68477.863       44693.580       41078.950       14843.372       14157.563        9519.616        5584.057        5502.654        4904.649        4737.888        4690.494     
##   p                                               0.000           0.000           0.000           0.000           0.000           0.000           0.000           0.000           0.000           0.000           0.000     
##   Log-likelihood                             119107.488      124531.466      132064.086      134126.052      135649.676      135878.977      136325.808      136946.168      126186.902      126199.599      126793.657     
##   Deviance                                      299.890         263.900         220.970         210.487         203.062         201.968         199.852         196.951         175.361         175.304         172.639     
##   AIC                                       -238208.976     -249054.931     -264118.171     -268230.104     -271275.352     -271723.954     -272595.616     -273834.336     -252313.804     -252337.197     -253523.315     
##   BIC                                       -238180.930     -249017.537     -264071.428     -268127.269     -271163.168     -271565.027     -272333.853     -273563.224     -252036.041     -252050.176     -253227.034     
##   N                                           84853           84853           84853           84853           84853           84853           84853           84853           77557           77557           77557         
## ============================================================================================================================================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Conducting multivariate analyses it becomes even clearer that based on how high the income is, the loans are higher and have lower rates. The plots also visualise how the rates and amounts changed over the years. It could also be credit card utilisation patterns vary within different income groups.

Were there any interesting or surprising interactions between features?

Prosper did not differentiate before 2007 and does not have any borrowers with zero income on unemployment status after as of 2013 (maybe because they did not allow any loans for this group.)

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

The multinomial linear regression model takes into account 12 different variables and achieves an R^2 of 0.645 for predicting BorrowerAPR. It only looks at linear relationships of variables. Therefore, performance of the model could be improved, by looking at non-linear relations between variables. Also, advanced feature engineering and cleaning of the data set and adding more variables could improve performance.


Final Plots and Summary

Plot One

Description One

Most loans are being taken out during the end of the year. This is quite interesting and can have several reasons. The most intuitive one is that people have plans for the following year and therefore borrow money. Another theory might be that people run out of money during the end of the year and need to borrow.

Plot Two

Description Two

The BorrowerAPR is highly dependent on the income range. Having more income makes it more likely to get a better rate. On the other hand, having lower income increases the chances to get a worse rate. There is an exception with earning $0. This group could include students who get cheaper student loans.

Plot Three

Description Three

This plot shows the chronological sequence of the APR based on the borrowers’ income ranges. It can be seen that the overall rate increases until 2011 and subsequently drops, except for low earners and unemployed borrowers who cease to exist in the data set as of 2013.


Reflection

This data analysis included the behaviour of various borrowers based on various features. Insight was gained, especially considering the borrowers’ employment statuses, income ranges, credit card utilisation and use of loans. Finally, a model was generated to predict borrowers rates based on 12 of their characteristics.

The dataset was challenging to some extent. As it includes approx. 80 variables a lot of work was initially pu into understanding the dataset and looking at different variables. Ggpairs on a short list of features was especially helpful and also using the spreadsheet supported with getting a better grasp on the different features.

Additionally, some of the struggles included having to deal with features which were not internally congruent, e.g. income range, which included “Not employed”, even though this should have been part of “Employment Status” only. Also, different values of “Employment Status” did not have the same quality. While, “Employed” and “Unemployed” are contrary, “Full-time” and “Part-time” are subsets of “Employed” and cannot be directly compared with “Employed”.

On the other hand, it was suprising to see, how the different features influence the BorrowerAPR (e.g. Score and Income Range). It was also suprising that low-income and unemployed borrowers ceased to exist in the data set as of 2013.

Further cleaning, feature engineering and exploring more variables could be taken into account in order to gain further insights and build more powerful models. For example, it could be explored, which estimated values (e.g. EstimatedLoss, EstimatedReturn, EstimatedYield …) have on the model or what impact variables such as Occupation or CurrentlyInAGroup have. Also, it could be analysed, if the performance of the model changes, when getting rid of the pre-2007 values, which do not have different income ranges incorprated. Further features could be engineered and used for prediction, e.g. ratios between features such as Credit Lines by Delinquencies. Finally, it could be analysed if non-linear assumptions or higher-range functions of features generate better model results.